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We show that in problems of authorship attribution and other linguistic applications, a Markov 
Chains approach is a more attractive technique than Lempel-Ziv based compression. 

PACS numbers: 89.70.+C, 01.20.+X, 05.20.-y, 05.45.Tp 


We wish to point out a number of inaccurate and mis- 
leading statements that Benedetto et al. make in their 
paper titled "Language Trees and Zipping" 0. First, 
they claim the technique they used for construction of a 
language tree does not make use of any a-priori informa- 
tion about the alphabet, but it does, both in the alphabet 
chosen (Unicode) and in the set of languages they chose 
to experiment with; second, they propound Lempel-Ziv 
(LZ, gzip) compression as being applicable to DNA anal- 
ysis, where the usefulness of LZ is quite doubtful; third, 
in practice their definition of relative entropy and dis- 
tance can yield negative values; fourth, the classification 
performance of the method they use is significantly worse 
than other entropy-based methods as has been noted in 
prior work; and fifth, the classification speed is signifi- 
cantly worse as well, which shows that its "potentiality" 
is questionable. We elaborate on each of these points in 
more detail in the subsequent paragraphs. 

Notice that the "Language Tree" (LT) diagram []J does 
not include the Russian language (Slavic family of Indo- 
European family of languages; 288 million speakers). Our 
computations show that once Russian is included, it does 
not cluster with the other members of the Slavic group. 
Obviously, certain Cyrillic alphabet based languages were 
left out of the study jfj , which "improves" results signifi- 
cantly and shows that a-priori information about the al- 
phabet is being taken advantage of to achieve the results 
outlined in paper 0. 

The LZ compressor makes few assumptions about the 
input string, but in practice, we do have a-priori infor- 
mation that we can take advantage of. Biologists widely 
use an amino acid substitution matrix (PAM250 or BLO- 
SUM62) in search for similar biological sequences |2j. It 
is not at all clear how a substitution matrix could be 
implemented with the LZ algorithm. That is why com- 
pression is not widely used for DNA analysis, although 
first trials for its application go back to 1990 2]. 

The quantity Sab defined as "relative entropy" in 
(1) and redefined as "distance" in (2) can take negative 
values. Negative values indeed appeared in our study 
which showed that the "LT" |jj reflects significantly the 
structure of Unicode or vice versa, and its relevance to 
language classification should be supported additionally. 


A traditional definition and estimates for (relative) en- 
tropy via nth order Markov Chain on letters 0- LH- al- 
ways lead to a proper positive number. Markov Chains 
are also traditional in text entropy analysis 0, , com- 
pression 0, authorship and subject attribution 0, 0- 
In 5], the classification performance of compression 
programs was compared with the Markov Chain ap- 
proach 82 authors of large enough texts (> 10 5 char- 
acters) were chosen. Afterwards 82 one-per-author texts 
were held out and used for control purposes. The classifi- 
cation algorithm [j| had to determine the author of each 
control text among 82 alternatives. The corresponding 
numbers of exact guesses for 15 compression programs 
and Markov Chains are presented in the following list []| : 

Program(number of guesses): 7zip(39), arj(46), 
bsa(44), compress(12), dmc(36), gzip(50), ha(47), 
huff(10), lzari(17), ppmd5(46),rar(58), rarw(71), rk(52); 
Markov Chain approach (see |8|) 69 guesses. 

Clearly, gzip is significantly outperformed by other 
compression algorithms and the first order Markov chain 
model |8j. Notice also that in practical implementations, 
the gzip-b&sed approach is significantly slower than 
the first order Markov chains method 

To sum up, in natural language processing (and, per- 
haps, in other fields) the nth order Markov chain mod- 
els 0, 0] are more appropriate than an LZ-approach 0] . 


* Electronic address: D.Khmelev@newton.cam.ac.uk 
' Electronic address: wjt@informatics.bangor.ac.uk 
[1] D. Benedetto, E. Caglioti, and V. Loreto, Phys. Rev. Lett. 
88 (2002). 

[2] D. Gusfield, Algorithms on strings, trees, and sequences 

(Cambridge University Press, Cambridge, 1997). 
[3] C. Shannon, Bell Syst. Tech. J. 27 (1948). 
[4] A. Yaglom and I. Yaglom, Probability and Information 

(Reidel, Boston, 1983). 
[5] O. Kukushkina, A. Polikarpov, and D. Khmelev, Problems 

of Information Transmission 37, 172 (2001). 
[6] J. G. Cleary and I. H. Witten, IEEE Trans, on Commun. 

22, 541 (1984). 
[7] W. J. Teahan, Proceedings RIAO'2000 2, 943 (2000). 
[8] D. Khmelev, J. of Quantitative Linguistics 7, 201 (2000). 


